An introduction to the tidyverse

The contents of this document illustrate the basic use and advantages of the tidyverse suite of packages for data organization and management. The contents and examples were adapted from Ameztegui and Granda (2019).

The tidyverse

The tidyverse is a collection of R packages designed for data science, as a suite aimed at easening the data analysis in all its steps.

Created by Hadley Wickham, chief scientist of RStudio, and author of more than 30 R packages (readr, ggplot2, plyr, devtools, roxygen2, rmarkdown…). All packages share an underlying design philosophy, grammar, and data structures.

So what’s exactly in the tidyverse?

  • ggplot2 a system for creating graphics, based on the Grammar of Graphics

  • readr a fast and friendly way to read rectangular data (csv, txt…)

  • tibble a tibble is a re-imagining version of the data frame, keeping what time has proven to be effective and throwing out what has not

  • stringr provides a cohesive set of functions designed to make working with strings as easy as possible

  • forcats provides a suite of useful tools that solve common problems with factors

  • dplyr provides a grammar of data manipulation, providing a consistent set of verbs that solve the most common data manipulation challenges

  • tidyr provides a set of functions that help you get to tidy data

  • purrr enhances functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors

Installation and use

  • Install all the packages in the tidyverse by running install.packages("tidyverse")

  • Run library(tidyverse) to load the core tidyverse and make it available in your current R session.
  • Learn more about the tidyverse package at http://tidyverse.tidyverse.org.
  • Or check the cheatsheets

Before we start…

  • Neither tidyr, nor dplyr or purrr do anything that can’t be done with base R code, apply family functions, for loops or other packages.

  • Designed to be more efficient (in time), easier to read and easier to use. More intuitive to use, specially for beginners (it may require some adaptation if you are used to base R code).

  • Valid mostly for data.frames. For other formats (matrices, arrays) plyr can be used.

Our data

  1. plots [11858 x 15]: all plots from the Third Spanish Forest Inventory (IFN3) in Catalonia

  2. trees [111756 x 12]: all trees with dbh > 7.5 cm measured in both IFN2 and IFN3

  3. species [14778 x 15]: number of trees per hectare in each plot, by species and size class

  4. coordinates [11858 x 6]: X and Y UTM coordinates of each plot.

  5. leaf [10447 x 3]: leaf biomass and carbon content for those IFN3 plots where they were available

A brief introduction of data managment with dplyr

Let’s have a look at the data

Data inspection must be the first thing we do before starting any kind of analysis.

trees
## # A tibble: 111,756 x 10
##    Codi   Provincia Especie Rumbo  Dist      N    CD DiamIf3 DiamIf2 HeiIf3
##    <fct>  <chr>     <fct>   <dbl> <dbl>  <dbl> <dbl>   <dbl>   <dbl>  <dbl>
##  1 080001 08        022         7  8.3   31.8     20    20.3    18.9    9  
##  2 080002 08        476        38  9.1   31.8     35    34      32.4    9  
##  3 080003 08        021        25  7     31.8     25    24.8    17.6   11  
##  4 080004 08        021        28  8.89  31.8     15    16.8    12.6    9.5
##  5 080006 08        021        19 11.2   14.1     35    34.0    30.9   13  
##  6 080007 08        021        32 12     14.1     35    33.1    28.2   10  
##  7 080008 08        243        40  7.8   31.8     15    15      13.2    6  
##  8 080009 08        045        16  5.09  31.8     20    17.5    15.3    7  
##  9 080010 08        243        47 26.9    5.09    65    67.4    66.8   16.5
## 10 080013 08        022        44  2.7  127.      15    15.1    12.6    9.5
## # ... with 111,746 more rows
plots
## # A tibble: 11,858 x 15
##    Codi  Provincia Cla   Subclase FccTot FccArb FechaIni   HoraIni            
##    <fct> <chr>     <fct> <fct>     <int>  <int> <date>     <dttm>             
##  1 0800~ 08        A     1            80     70 2001-07-09 2017-11-26 09:44:00
##  2 0800~ 08        A     1            80     70 2001-08-06 2017-11-26 09:18:58
##  3 0800~ 08        A     1            90     80 2001-08-06 2017-11-26 12:08:09
##  4 0800~ 08        A     1            90     50 2001-07-09 2017-11-26 13:23:23
##  5 0800~ 08        A     1            70     60 2001-08-03 2017-11-26 09:11:28
##  6 0800~ 08        A     1            90     90 2001-08-01 2017-11-26 13:00:33
##  7 0800~ 08        A     1            90     90 2001-08-07 2017-11-26 10:08:15
##  8 0800~ 08        A     1            70     60 2001-08-03 2017-11-26 12:12:03
##  9 0800~ 08        A     1            80     70 2001-08-02 2017-11-26 09:00:16
## 10 0800~ 08        A     1            80     80 2001-06-14 2017-11-26 12:34:21
## # ... with 11,848 more rows, and 7 more variables: FechaFin <date>, HoraFin <dttm>,
## #   Rocosid <int>, Textura <int>, MatOrg <int>, PhSuelo <int>, FechaPh <date>
species
## Warning: Detecting old grouped_df format, replacing `vars` attribute by `groups`
## # A tibble: 14,778 x 15
## # Groups:   Codi, Especie [14,778]
##    Codi   Especie CD_10 CD_15 CD_20 CD_25 CD_30 CD_35 CD_40 CD_45 CD_50 CD_55 CD_60
##  * <fct>  <fct>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 080001 022         0 159.   31.8 111.  60.1   19.2  5.09   0    0        0   0  
##  2 080002 021         0   0     0     0    0     74.2 28.3   63.7  0        0   0  
##  3 080002 022         0   0     0   173.  31.8    0    0      0    0        0   0  
##  4 080002 476         0   0     0     0    0     31.8  0      0    0        0   0  
##  5 080003 021         0   0     0    31.8  0      0    0      0    5.09     0   0  
##  6 080003 022         0 127.    0     0   46.0  127.   0     14.1  0        0   0  
##  7 080004 021         0  31.8   0     0   31.8    0    0      0    0        0   0  
##  8 080005 071         0   0     0    14.1  5.09  31.8  0      0    0        0  31.8
##  9 080005 243         0   0    14.1   0   14.1    0    5.09   0    0        0   0  
## 10 080005 278         0 286.   31.8   0    0     31.8  0      0    0        0   0  
## # ... with 14,768 more rows, and 2 more variables: CD_65 <dbl>, CD_70 <dbl>
class(trees)
## [1] "tbl_df"     "tbl"        "data.frame"

Tibbles, are not usual data.frames:

  • class tbl_df

  • print only 10 rows by default

  • informs about variable types

  • Besides that, it is like a data.frame (it is a data.frame)

glimpse(trees)
## Observations: 111,756
## Variables: 10
## $ Codi      <fct> 080001, 080002, 080003, 080004, 080006, 080007, 080008, 080009...
## $ Provincia <chr> "08", "08", "08", "08", "08", "08", "08", "08", "08", "08", "0...
## $ Especie   <fct> 022, 476, 021, 021, 021, 021, 243, 045, 243, 022, 021, 021, 07...
## $ Rumbo     <dbl> 7, 38, 25, 28, 19, 32, 40, 16, 47, 44, 13, 9, 9, 25, 199, 6, 4...
## $ Dist      <dbl> 8.30, 9.10, 7.00, 8.89, 11.19, 12.00, 7.80, 5.09, 26.89, 2.70,...
## $ N         <dbl> 31.83, 31.83, 31.83, 31.83, 14.14, 14.14, 31.83, 31.83, 5.09, ...
## $ CD        <dbl> 20, 35, 25, 15, 35, 35, 15, 20, 65, 15, 20, 30, 45, 20, 35, 30...
## $ DiamIf3   <dbl> 20.30, 34.00, 24.80, 16.85, 34.05, 33.10, 15.00, 17.50, 67.40,...
## $ DiamIf2   <dbl> 18.90, 32.45, 17.55, 12.65, 30.90, 28.15, 13.25, 15.30, 66.80,...
## $ HeiIf3    <dbl> 9.00, 9.00, 11.00, 9.50, 13.00, 10.00, 6.00, 7.00, 16.50, 9.50...

dplyr

dplyr is a grammar for data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:

The 5 main verbs of dplyr

  • filter: keep/select the rows that match a given condition

  • select: keep columns by name

  • arrange: sort rows acording to a given variable

  • mutate: transform existent variables or create new ones

  • summarise: do some summary statistics and reduce data

Common structure

Most of the tidyverse verbs/functions share certain elements in common:

verb(data, ...)
  • First argument: data (as data.frame or tbl_df)
  • The rest of arguments specify what to do with the data frame
  • Output is always another data frame (tbl_df or data.frame)
  • Unless we are assigning (<-), functions never modifies the original data

Selecting rows (filter)

Use filter() to choose rows/cases/observations where a given condition(s) is true.

Select trees taller than 6 meters:

filter(trees, HeiIf3 > 6)
## # A tibble: 100,247 x 10
##    Codi   Provincia Especie Rumbo  Dist      N    CD DiamIf3 DiamIf2 HeiIf3
##    <fct>  <chr>     <fct>   <dbl> <dbl>  <dbl> <dbl>   <dbl>   <dbl>  <dbl>
##  1 080001 08        022         7  8.3   31.8     20    20.3    18.9    9  
##  2 080002 08        476        38  9.1   31.8     35    34      32.4    9  
##  3 080003 08        021        25  7     31.8     25    24.8    17.6   11  
##  4 080004 08        021        28  8.89  31.8     15    16.8    12.6    9.5
##  5 080006 08        021        19 11.2   14.1     35    34.0    30.9   13  
##  6 080007 08        021        32 12     14.1     35    33.1    28.2   10  
##  7 080009 08        045        16  5.09  31.8     20    17.5    15.3    7  
##  8 080010 08        243        47 26.9    5.09    65    67.4    66.8   16.5
##  9 080013 08        022        44  2.7  127.      15    15.1    12.6    9.5
## 10 080014 08        021        13  8.89  31.8     20    19.2    14.2   11.5
## # ... with 100,237 more rows

Select trees in province 50, Zaragoza:

filter(trees, Provincia == '50')
## # A tibble: 0 x 10
## # ... with 10 variables: Codi <fct>, Provincia <chr>, Especie <fct>, Rumbo <dbl>,
## #   Dist <dbl>, N <dbl>, CD <dbl>, DiamIf3 <dbl>, DiamIf2 <dbl>, HeiIf3 <dbl>
filter(trees, CD %in% c(45, 70))
## # A tibble: 2,552 x 10
##    Codi   Provincia Especie Rumbo  Dist     N    CD DiamIf3 DiamIf2 HeiIf3
##    <fct>  <chr>     <fct>   <dbl> <dbl> <dbl> <dbl>   <dbl>   <dbl>  <dbl>
##  1 080016 08        071         9 26.9   5.09    45    46.0    45     19  
##  2 080113 08        021        19 11.1  14.1     70    72.6    70.7   20.9
##  3 080686 08        042        43  7.09 31.8     45    44.3    42.5   12.5
##  4 080721 08        042        16 15.8   5.09    45    46.2    39.8   17.5
##  5 080743 08        042         3  9.69 31.8     45    46.1    40.7   25.5
##  6 081271 08        024        93 10.8  14.1     45    44.0    39.0   17.3
##  7 081278 08        024        87 10.1  14.1     45    44.6    38.9   15.1
##  8 081354 08        026        16  8.6  31.8     45    43.4    30.8   16.9
##  9 081402 08        054        51  5.19 31.8     70   102.     95.5   17.7
## 10 081943 08        024         1 10.3  14.1     45    43.2    33.8   11.7
## # ... with 2,542 more rows

Exercise 1

Let's find those plots in IFN3n (`plots` data frame) that:

  1.1 Are located either in Barcelona (08) or Girona (17)
  
  1.2 Were measured **in** January 2001
  
  1.3 It took **more** than 2 hours to measure (7200s)

Selecting columns (select)

Select() it’s the function in dplyr to choose variables from a tbl. It keeps only the variables you mention.

trees
## # A tibble: 111,756 x 10
##    Codi   Provincia Especie Rumbo  Dist      N    CD DiamIf3 DiamIf2 HeiIf3
##    <fct>  <chr>     <fct>   <dbl> <dbl>  <dbl> <dbl>   <dbl>   <dbl>  <dbl>
##  1 080001 08        022         7  8.3   31.8     20    20.3    18.9    9  
##  2 080002 08        476        38  9.1   31.8     35    34      32.4    9  
##  3 080003 08        021        25  7     31.8     25    24.8    17.6   11  
##  4 080004 08        021        28  8.89  31.8     15    16.8    12.6    9.5
##  5 080006 08        021        19 11.2   14.1     35    34.0    30.9   13  
##  6 080007 08        021        32 12     14.1     35    33.1    28.2   10  
##  7 080008 08        243        40  7.8   31.8     15    15      13.2    6  
##  8 080009 08        045        16  5.09  31.8     20    17.5    15.3    7  
##  9 080010 08        243        47 26.9    5.09    65    67.4    66.8   16.5
## 10 080013 08        022        44  2.7  127.      15    15.1    12.6    9.5
## # ... with 111,746 more rows

Selecting a single column/variable:

select(trees, DiamIf3)
## # A tibble: 111,756 x 1
##    DiamIf3
##      <dbl>
##  1    20.3
##  2    34  
##  3    24.8
##  4    16.8
##  5    34.0
##  6    33.1
##  7    15  
##  8    17.5
##  9    67.4
## 10    15.1
## # ... with 111,746 more rows

Selecting all columns but the targeted one:

select(trees, -Codi)
## # A tibble: 111,756 x 9
##    Provincia Especie Rumbo  Dist      N    CD DiamIf3 DiamIf2 HeiIf3
##    <chr>     <fct>   <dbl> <dbl>  <dbl> <dbl>   <dbl>   <dbl>  <dbl>
##  1 08        022         7  8.3   31.8     20    20.3    18.9    9  
##  2 08        476        38  9.1   31.8     35    34      32.4    9  
##  3 08        021        25  7     31.8     25    24.8    17.6   11  
##  4 08        021        28  8.89  31.8     15    16.8    12.6    9.5
##  5 08        021        19 11.2   14.1     35    34.0    30.9   13  
##  6 08        021        32 12     14.1     35    33.1    28.2   10  
##  7 08        243        40  7.8   31.8     15    15      13.2    6  
##  8 08        045        16  5.09  31.8     20    17.5    15.3    7  
##  9 08        243        47 26.9    5.09    65    67.4    66.8   16.5
## 10 08        022        44  2.7  127.      15    15.1    12.6    9.5
## # ... with 111,746 more rows

Selecting several columns:

select(trees, DiamIf2, DiamIf3)
## # A tibble: 111,756 x 2
##    DiamIf2 DiamIf3
##      <dbl>   <dbl>
##  1    18.9    20.3
##  2    32.4    34  
##  3    17.6    24.8
##  4    12.6    16.8
##  5    30.9    34.0
##  6    28.2    33.1
##  7    13.2    15  
##  8    15.3    17.5
##  9    66.8    67.4
## 10    12.6    15.1
## # ... with 111,746 more rows

Selecting a range of columns:

select(trees, Codi:Dist)
## # A tibble: 111,756 x 5
##    Codi   Provincia Especie Rumbo  Dist
##    <fct>  <chr>     <fct>   <dbl> <dbl>
##  1 080001 08        022         7  8.3 
##  2 080002 08        476        38  9.1 
##  3 080003 08        021        25  7   
##  4 080004 08        021        28  8.89
##  5 080006 08        021        19 11.2 
##  6 080007 08        021        32 12   
##  7 080008 08        243        40  7.8 
##  8 080009 08        045        16  5.09
##  9 080010 08        243        47 26.9 
## 10 080013 08        022        44  2.7 
## # ... with 111,746 more rows

We can leverage some special functions especifically design to work withwith strings to enhance column extraction:

  • starts_with(x): names that start with x
  • ends_with(x): names that end with x
  • contains(x): selects all variables whose name contains x
  • matches(x): selects all variables whose name contains the regular expression x
  • num_range("x", 1:5, width = 2): selects all variables (numerically) from x01 to x05
  • one_of ("x", "y", "z"): selects variables provided in a character vector
  • everything(): selects all variables

Select columns that start with Diam:

select(trees, starts_with('Diam'))
## # A tibble: 111,756 x 2
##    DiamIf3 DiamIf2
##      <dbl>   <dbl>
##  1    20.3    18.9
##  2    34      32.4
##  3    24.8    17.6
##  4    16.8    12.6
##  5    34.0    30.9
##  6    33.1    28.2
##  7    15      13.2
##  8    17.5    15.3
##  9    67.4    66.8
## 10    15.1    12.6
## # ... with 111,746 more rows

Exercise 2

Think of three or four ways to select the variables that define the start and finish date of plot measuring.

Sorting rows (arrange)

arrange(trees, Dist)
## # A tibble: 111,756 x 10
##    Codi   Provincia Especie Rumbo  Dist     N    CD DiamIf3 DiamIf2 HeiIf3
##    <fct>  <chr>     <fct>   <dbl> <dbl> <dbl> <dbl>   <dbl>   <dbl>  <dbl>
##  1 170672 17        021        73   0    127.    15    17.2    15.2   13.5
##  2 251882 25        025       174   0.3  127.    15    13.5    10.4    6  
##  3 170640 17        045       290   0.3  127.    10    11.4    11.2    6.5
##  4 081544 08        025       176   0.3  127.    20    17.6    14.4    7  
##  5 081080 08        025       107   0.4  127.    15    13.8    10.2    8  
##  6 430170 43        024         0   0.4  127.    25    23.4    15.0    6.5
##  7 430631 43        021        17   0.4  127.    15    15.6    14     10.7
##  8 171941 17        079       150   0.4  127.    20    20.0    10.6   20  
##  9 171990 17        045        34   0.4  127.    25    23.2    19.6    8  
## 10 172056 17        045       160   0.4  127.    15    12.7    11.3    8  
## # ... with 111,746 more rows
arrange(trees, desc(Dist))
## # A tibble: 111,756 x 10
##    Codi   Provincia Especie Rumbo  Dist     N    CD DiamIf3 DiamIf2 HeiIf3
##    <fct>  <chr>     <fct>   <dbl> <dbl> <dbl> <dbl>   <dbl>   <dbl>  <dbl>
##  1 250008 25        042       398  36.5  5.09    70    74.8    72.3   11.1
##  2 250025 25        031       350  34.8  5.09    70    75.1    71.3   35.7
##  3 250135 25        022       158  33.9  5.09    50    51.8    54.0    9  
##  4 250051 25        021       110  33.6  5.09    60    58.2    56.8    6.5
##  5 250025 25        031       238  33.5  5.09    50    51      45.5   20.3
##  6 250010 25        071       214  32.9  5.09    70    85.3    71.6   21  
##  7 251788 25        045       163  32.8  5.09    60    61.6    51.7   10.5
##  8 250039 25        031       119  32.5  5.09    65    63.5    62.6   30  
##  9 250025 25        031       390  32.1  5.09    45    45.6    43.8   24.6
## 10 171289 17        071        38  31.4  5.09    60    60.8    50.4   26.9
## # ... with 111,746 more rows

Exercise 3

  3.1 Sort plots by date and hour of measurement
  
  3.2 Which plots were started to be measured later in the day?
  
  3.3 Which plots took longer to be measured?

Transforming variables (mutate)

mutate() allows to create/calculate new columns/variables in our dataset.

mutate(
  trees,
  Dist = Dist * 100
)
## # A tibble: 111,756 x 10
##    Codi   Provincia Especie Rumbo  Dist      N    CD DiamIf3 DiamIf2 HeiIf3
##    <fct>  <chr>     <fct>   <dbl> <dbl>  <dbl> <dbl>   <dbl>   <dbl>  <dbl>
##  1 080001 08        022         7  830.  31.8     20    20.3    18.9    9  
##  2 080002 08        476        38  910   31.8     35    34      32.4    9  
##  3 080003 08        021        25  700   31.8     25    24.8    17.6   11  
##  4 080004 08        021        28  889   31.8     15    16.8    12.6    9.5
##  5 080006 08        021        19 1119   14.1     35    34.0    30.9   13  
##  6 080007 08        021        32 1200   14.1     35    33.1    28.2   10  
##  7 080008 08        243        40  780   31.8     15    15      13.2    6  
##  8 080009 08        045        16  509   31.8     20    17.5    15.3    7  
##  9 080010 08        243        47 2689    5.09    65    67.4    66.8   16.5
## 10 080013 08        022        44  270  127.      15    15.1    12.6    9.5
## # ... with 111,746 more rows
mutate(
  trees,
  Alometria = DiamIf3 / HeiIf3,
  Alometria2 = Alometria * DiamIf2
)
## # A tibble: 111,756 x 12
##    Codi  Provincia Especie Rumbo  Dist      N    CD DiamIf3 DiamIf2 HeiIf3 Alometria
##    <fct> <chr>     <fct>   <dbl> <dbl>  <dbl> <dbl>   <dbl>   <dbl>  <dbl>     <dbl>
##  1 0800~ 08        022         7  8.3   31.8     20    20.3    18.9    9        2.26
##  2 0800~ 08        476        38  9.1   31.8     35    34      32.4    9        3.78
##  3 0800~ 08        021        25  7     31.8     25    24.8    17.6   11        2.25
##  4 0800~ 08        021        28  8.89  31.8     15    16.8    12.6    9.5      1.77
##  5 0800~ 08        021        19 11.2   14.1     35    34.0    30.9   13        2.62
##  6 0800~ 08        021        32 12     14.1     35    33.1    28.2   10        3.31
##  7 0800~ 08        243        40  7.8   31.8     15    15      13.2    6        2.5 
##  8 0800~ 08        045        16  5.09  31.8     20    17.5    15.3    7        2.5 
##  9 0800~ 08        243        47 26.9    5.09    65    67.4    66.8   16.5      4.08
## 10 0800~ 08        022        44  2.7  127.      15    15.1    12.6    9.5      1.59
## # ... with 111,746 more rows, and 1 more variable: Alometria2 <dbl>

Special functions:

  • if_else allows to implement a conditional structure in which we determine the output value when that condition was met and the value when it happens otherwise:
mutate(
  trees,
  Especie = if_else(Especie == '021', 'Pinus sylvestris', 'Other')
)
## # A tibble: 111,756 x 10
##    Codi   Provincia Especie          Rumbo  Dist      N    CD DiamIf3 DiamIf2 HeiIf3
##    <fct>  <chr>     <chr>            <dbl> <dbl>  <dbl> <dbl>   <dbl>   <dbl>  <dbl>
##  1 080001 08        Other                7  8.3   31.8     20    20.3    18.9    9  
##  2 080002 08        Other               38  9.1   31.8     35    34      32.4    9  
##  3 080003 08        Pinus sylvestris    25  7     31.8     25    24.8    17.6   11  
##  4 080004 08        Pinus sylvestris    28  8.89  31.8     15    16.8    12.6    9.5
##  5 080006 08        Pinus sylvestris    19 11.2   14.1     35    34.0    30.9   13  
##  6 080007 08        Pinus sylvestris    32 12     14.1     35    33.1    28.2   10  
##  7 080008 08        Other               40  7.8   31.8     15    15      13.2    6  
##  8 080009 08        Other               16  5.09  31.8     20    17.5    15.3    7  
##  9 080010 08        Other               47 26.9    5.09    65    67.4    66.8   16.5
## 10 080013 08        Other               44  2.7  127.      15    15.1    12.6    9.5
## # ... with 111,746 more rows

Exercise 4

  4.1 Get diameter growth (in cm) of each tree between IFN2 (DiamIf2)  and IFN3 (DiamIf3)

  4.2 Create two new variables with basal area of each tree (in $m^2$ per hectare), both for IFN2 and IFN3. Which is the species of the fastest-growing tree in basal area?

\[ AB = \frac{\pi}{4} · Diam^{2} · N \]

Reducing variables (summarise)

summarise(trees, mean_if3 = mean(DiamIf3))
## # A tibble: 1 x 1
##   mean_if3
##      <dbl>
## 1     23.4

Summary functions

  • min(x), max(x), quantile(x, p)

  • mean(x), median(x),

  • sd(x), var(x), IQR(x)

  • n(), n_distinct(x)

  • sum(x > 10), mean(x > 10)

Grouped summarise

One of the most common ways in which we create summaries is by grouping on the basis of certain information. That means, we split observations between a number of groups and the calculate the summary statistics.

by_province <- group_by(trees, Provincia)
by_province
## # A tibble: 111,756 x 10
## # Groups:   Provincia [4]
##    Codi   Provincia Especie Rumbo  Dist      N    CD DiamIf3 DiamIf2 HeiIf3
##    <fct>  <chr>     <fct>   <dbl> <dbl>  <dbl> <dbl>   <dbl>   <dbl>  <dbl>
##  1 080001 08        022         7  8.3   31.8     20    20.3    18.9    9  
##  2 080002 08        476        38  9.1   31.8     35    34      32.4    9  
##  3 080003 08        021        25  7     31.8     25    24.8    17.6   11  
##  4 080004 08        021        28  8.89  31.8     15    16.8    12.6    9.5
##  5 080006 08        021        19 11.2   14.1     35    34.0    30.9   13  
##  6 080007 08        021        32 12     14.1     35    33.1    28.2   10  
##  7 080008 08        243        40  7.8   31.8     15    15      13.2    6  
##  8 080009 08        045        16  5.09  31.8     20    17.5    15.3    7  
##  9 080010 08        243        47 26.9    5.09    65    67.4    66.8   16.5
## 10 080013 08        022        44  2.7  127.      15    15.1    12.6    9.5
## # ... with 111,746 more rows
summarise(
  by_province,
  mean_height_ifn3 = mean(HeiIf3, na.rm = TRUE),
  max_height_ifn3 = max(HeiIf3, na.rm = TRUE),
  min_height_ifn3 = min(HeiIf3, na.rm = TRUE)
)
## # A tibble: 4 x 4
##   Provincia mean_height_ifn3 max_height_ifn3 min_height_ifn3
##   <chr>                <dbl>           <dbl>           <dbl>
## 1 08                   11.2             35               1.6
## 2 17                   11.4             38               1.5
## 3 25                   11.4             35.8             2  
## 4 43                    9.21            33               2
summarise(
  by_province,
  R = cor(HeiIf3, DiamIf3)
)
## # A tibble: 4 x 2
##   Provincia     R
##   <chr>     <dbl>
## 1 08        0.621
## 2 17        0.600
## 3 25        0.646
## 4 43        0.700

Exercise 5

  5.1 Which species showed the largest increase in diameter between IFN2 and IFN3?

  5.2 Which province gather the largest number of species?

summarise_if and summarise_at

We can apply a summarising function to a group of variables that share some commmon characteristics that can be tested (i.e. numeric variables)

summarise_if(trees, is.numeric, mean)
## # A tibble: 1 x 7
##   Rumbo  Dist     N    CD DiamIf3 DiamIf2 HeiIf3
##   <dbl> <dbl> <dbl> <dbl>   <dbl>   <dbl>  <dbl>
## 1  198.  7.78  55.6  23.4    23.4    20.3   11.1

We can limit the summary by name or using the select helpers (starts_with, ends_with, one_of)

summarise_at(trees, vars(starts_with('Diam')), mean)
## # A tibble: 1 x 2
##   DiamIf3 DiamIf2
##     <dbl>   <dbl>
## 1    23.4    20.3

Targeted transformations

The same can be done with mutate:

mutate_if(trees, is.numeric, log)
## # A tibble: 111,756 x 10
##    Codi   Provincia Especie Rumbo  Dist     N    CD DiamIf3 DiamIf2 HeiIf3
##    <fct>  <chr>     <fct>   <dbl> <dbl> <dbl> <dbl>   <dbl>   <dbl>  <dbl>
##  1 080001 08        022      1.95 2.12   3.46  3.00    3.01    2.94   2.20
##  2 080002 08        476      3.64 2.21   3.46  3.56    3.53    3.48   2.20
##  3 080003 08        021      3.22 1.95   3.46  3.22    3.21    2.87   2.40
##  4 080004 08        021      3.33 2.18   3.46  2.71    2.82    2.54   2.25
##  5 080006 08        021      2.94 2.42   2.65  3.56    3.53    3.43   2.56
##  6 080007 08        021      3.47 2.48   2.65  3.56    3.50    3.34   2.30
##  7 080008 08        243      3.69 2.05   3.46  2.71    2.71    2.58   1.79
##  8 080009 08        045      2.77 1.63   3.46  3.00    2.86    2.73   1.95
##  9 080010 08        243      3.85 3.29   1.63  4.17    4.21    4.20   2.80
## 10 080013 08        022      3.78 0.993  4.85  2.71    2.71    2.53   2.25
## # ... with 111,746 more rows
mutate_at(
  trees, vars(one_of(c('Especie', 'Species'))),
  ~ paste0('sp_', .x)
)
## Warning: Unknown columns: `Species`
## # A tibble: 111,756 x 10
##    Codi   Provincia Especie Rumbo  Dist      N    CD DiamIf3 DiamIf2 HeiIf3
##    <fct>  <chr>     <chr>   <dbl> <dbl>  <dbl> <dbl>   <dbl>   <dbl>  <dbl>
##  1 080001 08        sp_022      7  8.3   31.8     20    20.3    18.9    9  
##  2 080002 08        sp_476     38  9.1   31.8     35    34      32.4    9  
##  3 080003 08        sp_021     25  7     31.8     25    24.8    17.6   11  
##  4 080004 08        sp_021     28  8.89  31.8     15    16.8    12.6    9.5
##  5 080006 08        sp_021     19 11.2   14.1     35    34.0    30.9   13  
##  6 080007 08        sp_021     32 12     14.1     35    33.1    28.2   10  
##  7 080008 08        sp_243     40  7.8   31.8     15    15      13.2    6  
##  8 080009 08        sp_045     16  5.09  31.8     20    17.5    15.3    7  
##  9 080010 08        sp_243     47 26.9    5.09    65    67.4    66.8   16.5
## 10 080013 08        sp_022     44  2.7  127.      15    15.1    12.6    9.5
## # ... with 111,746 more rows

Pipes and data pipelines (%>%)

Often, we will need to use several verbs (filter, arrange, group_by, summarise…) to manage our data towards a final result. However, multiple operations are difficult to read, or require to create multiple intermediate objects to achieve our goal:

In programming languages this is usually solved nesting functions one inside another, so that we applie and ordered sequence of instructions to our data:

diam_especie <- filter(
  summarise(
    group_by(
      filter(
        trees, !is.na(DiamIf3)
      ),
      Codi, Especie
    ),
    diam = mean(DiamIf3),
    n = n()
  ),
  n > 5
)
no_na_trees <- filter(
  trees, !is.na(DiamIf3)
)
no_na_trees_grouped <- group_by(
  no_na_trees, Codi, Especie
)
summarised_no_na_trees <- summarise(
  no_na_trees_grouped,
  diam = mean(DiamIf3), n = n()
)
final_data <- filter(
  summarised_no_na_trees, n > 5
)

The tidy approach solves this issue, producing a way better coding experience (cleaner and easy to read) by means of its alternative pipe operator (%>%) from magrittr package

The name pipe refers to the ability of connecting the input on the left side passing it to the function in the right as its first argument:

f(x, y) is the same as x %>% f(y)
f(x, y, z) is the same as x %>% f(y, z)

Summing up, in the tidyverse %>% makes each function to be applied to the data frame resulting from the previous step.

filter(df, color == 'blue') is the same as df %>% filter(color == 'blue')
mutate(df, double = 2*value) is the same as df %>% mutate(double = 2*value)

Untidy

diam_especie <- filter(
  summarise(
    group_by(
      filter(
        trees, !is.na(DiamIf3)
      ),
      Codi, Especie
    ),
    diam = mean(DiamIf3),
    n = n()
  ),
  n > 5
)

Tidy

diam_especie <- trees %>%
  filter(!is.na(DiamIf3)) %>%
  group_by(Codi, Especie) %>%
  summarise(
    diam = mean(DiamIf3),
    n = n()
  ) %>%
  filter(n > 5)

Exercise 6

Create pipelines to answer the following questions:

  6.1 Which **plots** have the fastest average growth rate?
  
  6.2 Which is the plot with the **most species**?
  
  6.3 Is there any **relationship** between both variables? <br>
  *(Optional, some knowledge on `ggplot`is required)*

Grouped mutate/filter

We will commonly use groups (group_by) when summarising variables (n inputs, one output):

group_by(Especie) %>% summarise(mean = mean(Diam))

Sometimes, however, we may be interested in calculating new variables by group, but without reducing the dimensions:

trees %>%
  group_by(Especie) %>%
  mutate(
    std_diam = DiamIf3 - mean(DiamIf3)
  )
## # A tibble: 111,756 x 11
## # Groups:   Especie [91]
##    Codi   Provincia Especie Rumbo  Dist      N    CD DiamIf3 DiamIf2 HeiIf3 std_diam
##    <fct>  <chr>     <fct>   <dbl> <dbl>  <dbl> <dbl>   <dbl>   <dbl>  <dbl>    <dbl>
##  1 080001 08        022         7  8.3   31.8     20    20.3    18.9    9     -6.71 
##  2 080002 08        476        38  9.1   31.8     35    34      32.4    9     13.5  
##  3 080003 08        021        25  7     31.8     25    24.8    17.6   11     -0.555
##  4 080004 08        021        28  8.89  31.8     15    16.8    12.6    9.5   -8.50 
##  5 080006 08        021        19 11.2   14.1     35    34.0    30.9   13      8.70 
##  6 080007 08        021        32 12     14.1     35    33.1    28.2   10      7.75 
##  7 080008 08        243        40  7.8   31.8     15    15      13.2    6     -5.86 
##  8 080009 08        045        16  5.09  31.8     20    17.5    15.3    7      1.40 
##  9 080010 08        243        47 26.9    5.09    65    67.4    66.8   16.5   46.5  
## 10 080013 08        022        44  2.7  127.      15    15.1    12.6    9.5  -11.9  
## # ... with 111,746 more rows

Exercise 7

  7.1 Identify those trees that grew most as compared to the average in that plot <br>
  .font80[(Hint: calculate growth, *then* mean growth by plot, and *then* the difference)]
  
  7.2 Identify those plots where a species grows much more than the average for the species

Extra (in case you get bored):

  7.3 Select IFN plots with pure Pinus nigra stands (Especie = 025). Note: we consider a forest to be monospecific when > 80% in BA corresponds to a single species

Joining information from two tables

Sometimes, especially when information comes from database structures, we will find our information split in two or more tables that share a common variable (primary/external key).

dplyroffers several choices to merge information, depending on our specific needs.

left_join(x, y): Add observations in y that also appears in x. Original observations (x) are not lost:

right_join(x, y)`: Add observations in x that also appears in y. Original observations (y) are not lost:

full_join(x, y): All observations, x and y:

inner_join(x, y): Only observations present in both x and y:

Exercise 8

Add X and Y coordinates that are included in the `coordinates` data frame to each plot in
the `plots` data frame

tidyr

Data in tidy format eases the processing and analysis, particularly in vectorized languages as R.

But data is not always properly organized and may incurr in several issues that harm its analysis. In the next example, the name of the colums are values rather than actual variables.

## # A tibble: 5,769 x 22
##    iso2   year   m04  m514  m014 m1524 m2534 m3544 m4554 m5564   m65    mu   f04  f514  f014 f1524
##    <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
##  1 AD     1989    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
##  2 AD     1990    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
##  3 AD     1991    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
##  4 AD     1992    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
##  5 AD     1993    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
##  6 AD     1994    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
##  7 AD     1996    NA    NA     0     0     0     4     1     0     0    NA    NA    NA     0     1
##  8 AD     1997    NA    NA     0     0     1     2     2     1     6    NA    NA    NA     0     1
##  9 AD     1998    NA    NA     0     0     0     1     0     0     0    NA    NA    NA    NA    NA
## 10 AD     1999    NA    NA     0     0     0     1     1     0     0    NA    NA    NA     0     0
## # ... with 5,759 more rows, and 6 more variables: f2534 <int>, f3544 <int>, f4554 <int>,
## #   f5564 <int>, f65 <int>, fu <int>

The package tidyr brings in a number of functions to organize our data. This often involves converting table-like formats between wide to long formats.

Wide versus long data

The tidyverse and tidy data

Data in tidy format eases the processing and analysis, particularly in vectorized languages as R. Data has to meet some requirement in order to be considered tidy:

  • Each row represents an observation
  • Columns comprise variables
  • Echa combination of row and column gathers a value

There are 5 examples of messy data we will explore here:

  • Column headers are values, not variable names.
  • Multiple variables are stored in one column.
  • Variables are stored in both rows and columns.
  • Multiple types of observational units are stored in the same table.
  • A single observational unit is stored in multiple tables.

Illustrating the difference between wide and long datasets is easiest using an example:

country_long <- data.frame(
    expand.grid(country = c("Sweden", "Denmark", "Norway"), year = 1994:1996),
    avgtemp = round(runif(9, 3, 12), 0)
    )
country_long
##   country year avgtemp
## 1  Sweden 1994      11
## 2 Denmark 1994       4
## 3  Norway 1994       6
## 4  Sweden 1995      12
## 5 Denmark 1995      11
## 6  Norway 1995       3
## 7  Sweden 1996       4
## 8 Denmark 1996       5
## 9  Norway 1996      11
country_wide <- data.frame(
    country = c("Sweden", "Denmark", "Norway"),
    avgtemp.1994 = country_long$avgtemp[1:3],
    avgtemp.1995 = country_long$avgtemp[4:6],
    avgtemp.1996 = country_long$avgtemp[7:9])
country_wide 
##   country avgtemp.1994 avgtemp.1995 avgtemp.1996
## 1  Sweden           11           12            4
## 2 Denmark            4           11            5
## 3  Norway            6            3           11

The long dataset separates the unit of analysis (country-year) into two separate variables while the wide dataset combines one of the keys (year) with the value variable (avgtemp).

A case for long data

There are many reasons to prefer datasets structured in long form. Repeating some of the points made in Wickham (2014), here are three reasons why you should attempt to structure your data in long form:

  1. If you have many value variables, it is difficult to summarize wide-form datasets at a glance (which in turn makes it hard to identify mistakes in the data). For example, imagine we have a dataset with 50 years and 10 value variables of interest - this would result in 500 columns in wide form. Summarizing each column to look for strange observations, or simply understanding which variables are included in the dataset, becomes difficult in this case.

  2. Structuring data as key-value pairs - as is done in long-form datasets - facilitates conceptual clarity. For example, in country_long above, it is clear that the unit of analysis is country-year - or, put differently, that the variables country and year jointly constitute the key in the dataset. In wide-form datasets, one of the variables that constitutes the unit of analysis is mixed with a variable that holds values. (Read more about this in Hadley’s paper referenced above.)

  3. Long-form datasets are often required for advanced statistical analysis and graphing. For example, if you wanted to run a regression with year and/or country fixed effects, you would have to structure your data in long form. Furthermore, many graphing packages, including ggplot2, rely on your data being in long form.

The function gather enables the conversion from wide to long.

## # A tibble: 35,750 x 5
##    iso2   year sex   age_group     n
##    <chr> <int> <chr> <chr>     <int>
##  1 AD     1996 f     014           0
##  2 AD     1996 f     1524          1
##  3 AD     1996 f     2534          1
##  4 AD     1996 f     3544          0
##  5 AD     1996 f     4554          0
##  6 AD     1996 f     5564          1
##  7 AD     1996 f     65            0
##  8 AD     1996 m     014           0
##  9 AD     1996 m     1524          0
## 10 AD     1996 m     2534          0
## # ... with 35,740 more rows
species
## # A tibble: 14,778 x 15
## # Groups:   Codi, Especie [14,778]
##    Codi   Especie CD_10 CD_15 CD_20 CD_25 CD_30 CD_35 CD_40 CD_45 CD_50 CD_55 CD_60 CD_65 CD_70
##  * <fct>  <fct>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 080001 022         0 159.   31.8 111.  60.1   19.2  5.09   0    0        0   0       0     0
##  2 080002 021         0   0     0     0    0     74.2 28.3   63.7  0        0   0       0     0
##  3 080002 022         0   0     0   173.  31.8    0    0      0    0        0   0       0     0
##  4 080002 476         0   0     0     0    0     31.8  0      0    0        0   0       0     0
##  5 080003 021         0   0     0    31.8  0      0    0      0    5.09     0   0       0     0
##  6 080003 022         0 127.    0     0   46.0  127.   0     14.1  0        0   0       0     0
##  7 080004 021         0  31.8   0     0   31.8    0    0      0    0        0   0       0     0
##  8 080005 071         0   0     0    14.1  5.09  31.8  0      0    0        0  31.8     0     0
##  9 080005 243         0   0    14.1   0   14.1    0    5.09   0    0        0   0       0     0
## 10 080005 278         0 286.   31.8   0    0     31.8  0      0    0        0   0       0     0
## # ... with 14,768 more rows

The verbs of tidyr

  • gather: Convert data from wide to long format (columns to id-value pairs)
  • spread: Convert data from long to wide format (id-value pairs to columns)
  • separate: Convert one column in serveral
  • unite: Join several columns in one
n_parcelas <- tibble(
  Prov = c('Lleida', 'Girona', 'Barcelona', 'Tarragona'),
  IFN_2 = c(16, 78, 60, 34),
  IFN_3 = c(18, 79, 67, 36)
)

n_parcelas
## # A tibble: 4 x 3
##   Prov      IFN_2 IFN_3
##   <chr>     <dbl> <dbl>
## 1 Lleida       16    18
## 2 Girona       78    79
## 3 Barcelona    60    67
## 4 Tarragona    34    36

gather(df, key, value, vars)

n_parcelas_tidy <- gather(n_parcelas,IFN, n, IFN_2, IFN_3)
n_parcelas
## # A tibble: 4 x 3
##   Prov      IFN_2 IFN_3
##   <chr>     <dbl> <dbl>
## 1 Lleida       16    18
## 2 Girona       78    79
## 3 Barcelona    60    67
## 4 Tarragona    34    36
n_parcelas_tidy
## # A tibble: 8 x 3
##   Prov      IFN       n
##   <chr>     <chr> <dbl>
## 1 Lleida    IFN_2    16
## 2 Girona    IFN_2    78
## 3 Barcelona IFN_2    60
## 4 Tarragona IFN_2    34
## 5 Lleida    IFN_3    18
## 6 Girona    IFN_3    79
## 7 Barcelona IFN_3    67
## 8 Tarragona IFN_3    36

separate(df, col, into, sep)

n_parcelas_sep <-  separate(n_parcelas_tidy, IFN, c('source', 'version'), sep = '_')
n_parcelas_sep
## # A tibble: 8 x 4
##   Prov      source version     n
##   <chr>     <chr>  <chr>   <dbl>
## 1 Lleida    IFN    2          16
## 2 Girona    IFN    2          78
## 3 Barcelona IFN    2          60
## 4 Tarragona IFN    2          34
## 5 Lleida    IFN    3          18
## 6 Girona    IFN    3          79
## 7 Barcelona IFN    3          67
## 8 Tarragona IFN    3          36

unite(data, col, vars, sep)

n_parcelas_unite <- unite(n_parcelas_sep, IFN, source, version, sep = '_')
n_parcelas_unite
## # A tibble: 8 x 3
##   Prov      IFN       n
##   <chr>     <chr> <dbl>
## 1 Lleida    IFN_2    16
## 2 Girona    IFN_2    78
## 3 Barcelona IFN_2    60
## 4 Tarragona IFN_2    34
## 5 Lleida    IFN_3    18
## 6 Girona    IFN_3    79
## 7 Barcelona IFN_3    67
## 8 Tarragona IFN_3    36

Exercise 9

Use `gather` and `separate` to transform the data frame *species* into a **tidy** format, where each column is a variable and each row an observation.
## # A tibble: 14,778 x 15
## # Groups:   Codi, Especie [14,778]
##    Codi   Especie CD_10 CD_15 CD_20 CD_25 CD_30 CD_35 CD_40 CD_45 CD_50 CD_55 CD_60 CD_65 CD_70
##  * <fct>  <fct>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 080001 022         0 159.   31.8 111.  60.1   19.2  5.09   0    0        0   0       0     0
##  2 080002 021         0   0     0     0    0     74.2 28.3   63.7  0        0   0       0     0
##  3 080002 022         0   0     0   173.  31.8    0    0      0    0        0   0       0     0
##  4 080002 476         0   0     0     0    0     31.8  0      0    0        0   0       0     0
##  5 080003 021         0   0     0    31.8  0      0    0      0    5.09     0   0       0     0
##  6 080003 022         0 127.    0     0   46.0  127.   0     14.1  0        0   0       0     0
##  7 080004 021         0  31.8   0     0   31.8    0    0      0    0        0   0       0     0
##  8 080005 071         0   0     0    14.1  5.09  31.8  0      0    0        0  31.8     0     0
##  9 080005 243         0   0    14.1   0   14.1    0    5.09   0    0        0   0       0     0
## 10 080005 278         0 286.   31.8   0    0     31.8  0      0    0        0   0       0     0
## # ... with 14,768 more rows

spread(df, key, value, sep)

n_parcelas2 <-  spread(n_parcelas_unite, IFN, n)
n_parcelas
## # A tibble: 4 x 3
##   Prov      IFN_2 IFN_3
##   <chr>     <dbl> <dbl>
## 1 Lleida       16    18
## 2 Girona       78    79
## 3 Barcelona    60    67
## 4 Tarragona    34    36

]

n_parcelas2
## # A tibble: 4 x 3
##   Prov      IFN_2 IFN_3
##   <chr>     <dbl> <dbl>
## 1 Barcelona    60    67
## 2 Girona       78    79
## 3 Lleida       16    18
## 4 Tarragona    34    36

Exercise 10

Use `unite` and `spread` to transform the data from exercise 9 into its original format

purrr

purrr allows to do functional programming with R, making loops and iterative tasks pipe-friendly and, thus, easier to read.

c(1, 2, 3) %>%
  map_dbl(~ .*2)
## [1] 2 4 6
c('a', 'b', 'c') %>%
  map_chr(~ paste0('treatment_', .))
## [1] "treatment_a" "treatment_b" "treatment_c"

Can be used as an alternative to apply family functions (lapply, vapply…) which are sometimes unclean and difficult to use.

purrr’s common structure

verb(.x, .f, ...)
  • first argument (.x): vector or list (included data frames)
  • second argument (.f): function to apply to each element of .x
  • output is always a vector or a list of the same length as .x

Main functions in purrr:

  • map: Tranforms the input by applying a function to each element and returning a vector the same length as the input.
    • map: Returns always a list
    • map_chr, map_dbl, map_int, map_lgl: Returns a vector of the corresponding type
    • map_dfr, map_dfc: Returns a data frame created by row- or column-binding

Models workflow become easier

trees %>%
  split(.$Provincia) %>%
  map(~ lm(HeiIf3 ~ DiamIf3, data = .)) %>%
  map_dfr(~ broom::tidy(.))
## # A tibble: 8 x 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)    5.04    0.0464      109.        0
## 2 DiamIf3        0.275   0.00192     144.        0
## 3 (Intercept)    4.74    0.0526       90.2       0
## 4 DiamIf3        0.290   0.00211     137.        0
## 5 (Intercept)    5.56    0.0404      137.        0
## 6 DiamIf3        0.234   0.00147     160.        0
## 7 (Intercept)    3.75    0.0614       61.2       0
## 8 DiamIf3        0.238   0.00247      96.4       0

References

Ameztegui, A., and V. Granda. 2019. “Introduction to the Tidyverse.” GitHub Repository. https://github.com/ameztegui/SIBECOL_workshop; GitHub.

Wickham, Hadley. 2014. “Tidy Data.” The Journal of Statistical Software 59 (10). http://www.jstatsoft.org/v59/i10/.

Marcos Rodrigues

Aitor Ameztegui

Victor Granda

2020/02/03